Bayesian Logistic Regression Capstone
  • Research
  • Slides
  • About Us

On this page

  • Introduction
  • Method
    • Bayesian Logistic Regression
    • Model Structure
    • Prior Specification
    • Advantages of Bayesian Logistic Regression
    • Posterior Predictions
    • Model Evaluation and Diagnostics
  • Analysis and Results
    • Data Preparation
      • Import and Merge Datasets
      • Adult Cohort Definition
      • Missing Data Summary
      • Exploratory Data Summary
    • Modeling Frameworks
      • Survey-Weighted Logistic Regression (Design-Based MLE)
      • Multiple Imputation (MICE)
      • Bayesian Logistic Regression
      • Model Fit and Calibration
    • Results
      • Population-level interpretation (posterior, odds ratios)
    • Discussion and Limitations
      • Interpretation
      • Limitations
    • Conclusion
  • References

Bayesian Logistic Regression for Predicting Diabetes Risk Using NHANES 2013–2014 Data

A Capstone Project on Bayesian Applications in Epidemiologic Modeling

Authors

Namita Mishra

Autumn Wilcox

Published

November 1, 2025

Slides: slides.html (Edit slides.qmd.)

Introduction

Diabetes mellitus (DM) remains a major public health challenge, and identifying key risk factors—such as obesity, age, sex, and race/ethnicity—is essential for prevention and targeted intervention. Logistic regression is widely used to estimate associations between such factors and binary outcomes like diabetes diagnosis. However, classical maximum likelihood estimation (MLE) can produce unstable estimates in the presence of missing data, quasi-separation, or small samples. Bayesian logistic regression offers a robust alternative by integrating prior information, regularizing estimates, and quantifying uncertainty more transparently than frequentist approaches.

This study applies Bayesian logistic regression to estimate the risk of doctor-diagnosed diabetes among adults in the 2013–2014 National Health and Nutrition Examination Survey (NHANES). Predictors include age, body mass index (BMI), sex, and a coarsened race/ethnicity factor (race3) comprising White, Black, and Hispanic groups, with low-frequency levels combined as “Other.” The NHANES race/ethnicity source variable (RIDRETH1) includes Mexican American, Other Hispanic, Non-Hispanic White, Non-Hispanic Black, and Other/Multi. For analysis, we collapsed these into four groups—White, Black, Hispanic, and Other—to ensure adequate representation and model stability.

Three analytic frameworks were compared: (1) survey-weighted maximum likelihood estimation (MLE) using the NHANES complex design, (2) multiple imputation (MICE) with predictive mean matching and Rubin’s rules, and (3) Bayesian inference with weakly informative priors \(N(0, 2.5)\) implemented via brms. The Bayesian model incorporated normalized NHANES exam weights as importance weights, approximating design-based inference. Across all methods, age and BMI were positively associated with diabetes odds, female sex tended to have lower odds than male, and Black and Hispanic adults showed higher odds relative to White. Agreement across modeling frameworks supports the robustness of these associations and highlights the interpretability and uncertainty quantification advantages offered by Bayesian analysis for population health modeling.

Bayesian hierarchical models, implemented via Markov Chain Monte Carlo (MCMC), have been successfully applied in predicting patient health status across diseases such as pneumonia, prostate cancer, and mental disorders (Zeger et al. 2020). By representing predictive uncertainty alongside point estimates, Bayesian inference offers a practical advantage in epidemiologic modeling where decisions hinge on probabilistic thresholds. Beyond stability, Bayesian methods support model checking, variable selection, and uncertainty quantification under missingness or imputation frameworks (Baldwin and Larson 2017; Kruschke and Liddell 2017).

Recent work has expanded Bayesian applications to disease diagnostics and health risk modeling. For instance, Bayesian approaches have been used to evaluate NHANES diagnostic data (Chatzimichail and Hatjimihail 2023), to model cardiovascular and metabolic risk (Liu et al. 2013), and to integrate multiple data modalities such as imaging and laboratory measures (Abdullah, Hassan, and Mustafa 2022). Moreover, multiple imputation combined with Bayesian modeling generates robust estimates when data are missing at random (MAR) or not at random (MNAR) (Austin et al. 2021).

The broader Bayesian literature emphasizes the role of priors and model checking. Weakly informative priors, such as \(N(0, 2.5)\) for coefficients, regularize estimation and reduce variance in small samples (Gelman et al. 2008; Vande Schoot et al. 2021). Tutorials using R packages like brms and blavaan illustrate how MCMC enables posterior inference and empirical Bayes analysis (Klauenberg et al. 2015).

Beyond standard generalized linear models, Bayesian nonparametric regression flexibly captures nonlinearity and zero inflation common in health data (Richardson and Hartman 2018). Bayesian Additive Regression Trees (BART) improve variable selection in mixed-type data (Luo et al. 2024), while state-space and dynamic Bayesian models incorporate time-varying biomarkers for longitudinal prediction (Momeni et al. 2021). Bayesian model averaging (BMA) further addresses model uncertainty by weighting across multiple specifications (Hoeting et al. 1999). Together, these approaches demonstrate the versatility and growing importance of Bayesian inference in clinical and epidemiologic modeling.

The objective of this project is to evaluate whether Bayesian inference provides more stable and interpretable estimates of diabetes risk than frequentist and imputation-based approaches, particularly when data complexity or separation challenges arise. Agreement across modeling frameworks supports the robustness of these associations and highlights the interpretability and uncertainty quantification advantages offered by Bayesian analysis in population health modeling (National Center for Health Statistics (NCHS) 2014).

The analytical framework for this study is grounded in Bayesian logistic regression, providing a probabilistic approach for estimating diabetes risk and quantifying uncertainty in population health modeling.

Method

Bayesian Logistic Regression

This study employs Bayesian logistic regression to estimate the association between predictors and a binary outcome.

The Bayesian framework integrates prior knowledge with observed data to generate posterior distributions, allowing parameters to be interpreted directly in probabilistic terms.

Unlike traditional frequentist approaches that yield single-point estimates and p-values, Bayesian methods represent parameters as random variables with full probability distributions.

This provides greater flexibility, incorporates parameter uncertainty, and produces credible intervals that directly quantify the probability that a parameter lies within a given range.

Model Structure

Bayesian logistic regression models the log-odds of a binary outcome as a linear combination of predictors:

\[ \text{logit}(P(Y = 1)) = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_k X_k \]

where

  • \(P(Y = 1)\) is the probability of the event of interest,
  • \(\beta_0\) is the intercept (log-odds when all predictors are zero), and
  • \(\beta_j\) represents the effect of predictor \(X_j\) on the log-odds of the outcome, holding other predictors constant.

In the Bayesian framework, model parameters (\(\boldsymbol{\beta}\)) are treated as random variables and assigned prior distributions that reflect existing knowledge or plausible ranges before observing the data. After incorporating the observed evidence, the priors are updated through Bayes’ theorem (Leeuw and Klugkist 2012; Klauenberg et al. 2015):

\[ \text{Posterior} \propto \text{Likelihood} \times \text{Prior} \]

  • Likelihood: represents the probability of the observed data given the model parameters—it captures how well different parameter values explain the data.
  • Prior: expresses beliefs or existing information about the parameters before observing the data.
  • Posterior: combines both, representing the updated distribution of parameter values after accounting for the data.

This formulation allows uncertainty to propagate naturally through the model, producing posterior distributions for each coefficient that can be directly interpreted as probabilities.

Prior Specification

Weakly informative priors were used to regularize estimation without imposing strong assumptions:

  • Regression coefficients: \(N(0, 2.5)\), providing gentle regularization while allowing substantial variation in plausible effects (Gelman et al. 2008; Vande Schoot et al. 2021).
  • Intercept: Student’s t-distribution prior, \(t(3, 0, 10)\) (Schoot et al. 2013; Vande Schoot et al. 2021), which has
    • 3 degrees of freedom (heavy tails to allow occasional large effects),
    • mean 0 (no bias toward positive or negative effects), and
    • scale 10 (broad range of possible values).

Such priors help stabilize estimation in the presence of multicollinearity, limited sample size, or potential outliers.

Advantages of Bayesian Logistic Regression

  • Uncertainty quantification: Produces full posterior distributions instead of single estimates.
  • Credible intervals: Provide the range within which a parameter lies with a specified probability (e.g., 95%).
  • Flexible priors: Allow integration of expert knowledge or findings from prior studies.
  • Probabilistic predictions: Posterior predictive distributions yield direct probabilities for new or future observations.
  • Model evaluation: Posterior predictive checks (PPCs) assess how well simulated outcomes reproduce observed data.

Posterior Predictions

Posterior distributions of regression coefficients were used to estimate the probability of the outcome for given predictor values. This allows statements such as: > “Given the predictors, the probability of the outcome lies between X% and Y%.”

Posterior predictions account for two key sources of uncertainty:

  1. Parameter uncertainty: Variability in estimated model coefficients.
  2. Predictive uncertainty: Variability in possible future outcomes given those parameters.

In Bayesian analysis, all unknown quantities—coefficients, means, variances, or probabilities—are treated as random variables described by their posterior distributions.

Model Evaluation and Diagnostics

Model quality and convergence were assessed using standard Bayesian diagnostics:

  • Posterior sampling: Conducted via Markov Chain Monte Carlo (MCMC) using the No-U-Turn Sampler (NUTS), a variant of Hamiltonian Monte Carlo (HMC) (Austin et al. 2021). Four chains were run with sufficient warm-up iterations to ensure convergence.
  • Convergence metrics: The potential scale reduction factor (\(\hat{R}\)) and effective sample size (ESS) were used to verify stability and mixing across chains.
  • Autocorrelation checks: Ensured independence between successive draws.
  • Posterior predictive checks (PPCs): Compared simulated outcomes to observed data to evaluate fit.
  • Bayesian \(R^2\): Quantified the proportion of variance explained by predictors, incorporating posterior uncertainty.

Analysis and Results

Data Preparation

We analyzed NHANES 2013–2014 public-use data from the CDC’s National Center for Health Statistics (National Center for Health Statistics (NCHS) 2014). Three component files were merged: demographics (DEMO_H), body measures (BMX_H), and the diabetes questionnaire (DIQ_H). All variables were coerced to consistent numeric or factor types prior to merging to ensure atomic columns suitable for survey analysis and modeling.

Import and Merge Datasets

Preview of merged NHANES 2013–2014 dataset limited to analysis variables (source columns only).
RIDAGEYR BMXBMI RIAGENDR RIDRETH1 DIQ010
69 26.7 1 4 1
54 28.6 1 3 1
72 28.9 1 3 1
9 17.1 1 3 2
73 19.7 2 3 2
56 41.7 1 1 2
0 NA 1 3 NA
61 35.7 2 3 2
42 NA 1 2 2
56 26.5 2 3 2

This preview shows the raw NHANES columns before transformation. Each variable is retained for later use in analysis and renamed or standardized as appropriate.

The merged dataset contains 10,175 participants. It integrates demographic, examination, and diabetes questionnaire data. We then restrict the sample to adults (age ≥ 20) to define the analytic cohort used in subsequent analyses. A small proportion of records have missing values in BMI and diabetes status, which will be addressed later through multiple imputation.

Adult Cohort Definition

Table 1: Variable Descriptions: Adult Analytic Dataset (NHANES 2013–2014)
Variable NHANES_Source Description Type
age RIDAGEYR Participant age in years (adults aged 20 years and older) Continuous
bmi BMXBMI Body Mass Index (BMI, kg/m²) measured during examination Continuous
sex RIAGENDR Sex of participant (Male or Female) Categorical
race3 RIDRETH1 Race/ethnicity collapsed into White, Black, Hispanic, and Other/Multi categories Categorical
diabetes_ind DIQ010 Doctor-diagnosed diabetes indicator (1 = Yes, 0 = No) Binary

These summaries confirm that only BMI and diabetes indicators contain missing values, supporting the need for multiple imputation while keeping other variables complete.

Code
# Show structure and first few participants

str(adult)
'data.frame':   5769 obs. of  11 variables:
 $ SDMVPSU     : num  1 1 1 2 1 1 2 1 2 2 ...
 $ SDMVSTRA    : num  112 108 109 116 111 114 106 112 112 113 ...
 $ WTMEC2YR    : num  13481 24472 57193 65542 25345 ...
 $ diabetes_ind: num  1 1 1 0 0 0 0 0 0 0 ...
 $ bmi         : num  26.7 28.6 28.9 19.7 41.7 35.7 NA 26.5 22 20.3 ...
 $ age         : num  69 54 72 73 56 61 42 56 65 26 ...
 $ sex         : Factor w/ 2 levels "Female","Male": 2 2 2 1 2 1 2 1 2 1 ...
 $ race        : Factor w/ 5 levels "Mexican American",..: 2 3 3 3 1 3 4 3 3 3 ...
 $ age_c       : num [1:5769, 1] 1.132 0.278 1.303 1.36 0.392 ...
  ..- attr(*, "scaled:center")= num 49.1
  ..- attr(*, "scaled:scale")= num 17.6
 $ bmi_c       : num [1:5769, 1] -0.3359 -0.0703 -0.0283 -1.3144 1.761 ...
  ..- attr(*, "scaled:center")= num 29.1
  ..- attr(*, "scaled:scale")= num 7.15
 $ race3       : Factor w/ 4 levels "White","Hispanic",..: 3 1 1 1 2 1 2 1 1 1 ...
head(adult, 10)
   SDMVPSU SDMVSTRA WTMEC2YR diabetes_ind  bmi age    sex             race
1        1      112 13481.04            1 26.7  69   Male         NH Black
2        1      108 24471.77            1 28.6  54   Male         NH White
3        1      109 57193.29            1 28.9  72   Male         NH White
4        2      116 65541.87            0 19.7  73 Female         NH White
5        1      111 25344.99            0 41.7  56   Male Mexican American
6        1      114 61758.65            0 35.7  61 Female         NH White
7        2      106     0.00            0   NA  42   Male   Other Hispanic
8        1      112 17480.12            0 26.5  56 Female         NH White
9        2      112 34795.43            0 22.0  65   Male         NH White
10       2      113 91523.52            0 20.3  26 Female         NH White
        age_c       bmi_c    race3
1   1.1324183 -0.33588609    Black
2   0.2783598 -0.07028101    White
3   1.3032300 -0.02834336    White
4   1.3601672 -1.31443114    White
5   0.3922343  1.76099614 Hispanic
6   0.6769204  0.92224325    White
7  -0.4048870          NA Hispanic
8   0.3922343 -0.36384452    White
9   0.9046694 -0.99290919    White
10 -1.3158827 -1.23055585    White

Descriptive statistics for continuous and categorical variables are presented below.

Table 1a. Continuous variables (age, BMI): N, missing, mean (SD), range.
Variable N Missing Mean SD Min Max
age 5769 0 49.11 17.56 20.0 80.0
bmi 5520 249 29.10 7.15 14.1 82.9
Table 1b. Categorical variables (sex, race3, diabetes_ind): counts and percentages.
Variable Level n pct
diabetes_ind No 4870 84.4
diabetes_ind Yes 722 12.5
diabetes_ind (Missing) 177 3.1
race3 White 2472 42.8
race3 Hispanic 1275 22.1
race3 Black 1177 20.4
race3 Other 845 14.6
sex Female 3011 52.2
sex Male 2758 47.8

Table 1a and 1b summarize the analytic variables included in subsequent models. Mean age and BMI values indicate an adult cohort spanning a wide range of body composition, while categorical summaries confirm balanced sex representation and sufficient sample sizes across race/ethnicity categories. These variables were standardized and used as predictors in all modeling frameworks.

Table 2: Excerpt of the NHANES 2013–2014 adult cohort (age ≥ 20; N = 5,769) with derived and standardized variables.
SDMVPSU SDMVSTRA WTMEC2YR diabetes_ind bmi age sex race age_c bmi_c race3
1 112 13481.04 1 26.7 69 Male NH Black 1.1324183 -0.33588609 Black
1 108 24471.77 1 28.6 54 Male NH White 0.2783598 -0.07028101 White
1 109 57193.29 1 28.9 72 Male NH White 1.3032300 -0.02834336 White
2 116 65541.87 0 19.7 73 Female NH White 1.3601672 -1.31443114 White
1 111 25344.99 0 41.7 56 Male Mexican American 0.3922343 1.76099614 Hispanic
1 114 61758.65 0 35.7 61 Female NH White 0.6769204 0.92224325 White

As shown in Table 2, the analytic adult cohort (N = 5,769) includes standardized variables for age and BMI (age_c, bmi_c), categorical indicators for sex and race/ethnicity (race3), and a binary doctor-diagnosed diabetes variable (diabetes_ind).

Missing Data Summary

Figure 1: Missing data pattern for analytic variables (outcome and predictors only).

The missingness plot visually confirms that BMI and diabetes status have modest proportions of missing data, with no evident systematic pattern across records.

Exploratory Data Summary

The adult analytic cohort was broadly representative of the U.S. population, with a majority identifying as Non-Hispanic White. Age and BMI distributions were right-skewed, with most participants classified as overweight or obese. Visual exploration revealed a clear positive relationship between age, BMI, and diabetes prevalence. Non-Hispanic Black and Hispanic participants exhibited higher proportions of diabetes compared to Non-Hispanic Whites. Missingness was minimal and primarily limited to BMI and diabetes status, supporting the use of multiple imputation for these variables.

Figure 2: Age distribution (age ≥ 20).
Figure 3: BMI distribution.
Figure 4: Sex composition.
Figure 5: Race/ethnicity composition (race3).

The EDA missingness summary shows approximately 4.3% missing BMI and 3.1% missing diabetes status (diabetes_ind). All design variables (WTMEC2YR, SDMVPSU, SDMVSTRA), as well as age, sex, and race3, are complete—sex and race NAs are encoded as explicit “(Missing)” levels in the EDA view.

Modeling Frameworks

Three modeling frameworks were compared using identical predictors (standardized age, BMI, sex, and race3) and the binary outcome diabetes_ind: (1) survey-weighted logistic regression to incorporate the NHANES complex sampling design, (2) multiple imputation (MICE) to address missing BMI values, and (3) Bayesian logistic regression with weakly informative priors to quantify uncertainty.

Survey-Weighted Logistic Regression (Design-Based MLE)

'data.frame':   5349 obs. of  5 variables:
 $ diabetes_ind: num  1 1 1 0 0 0 0 0 0 1 ...
 $ sex         : Factor w/ 2 levels "Female","Male": 2 2 2 1 2 1 1 2 1 2 ...
 $ race3       : Factor w/ 4 levels "White","Hispanic",..: 3 1 1 1 2 1 1 1 1 1 ...
 $ age_c       : num  1.132 0.278 1.303 1.36 0.392 ...
 $ bmi_c       : num  -0.3359 -0.0703 -0.0283 -1.3144 1.761 ...

Design-based odds ratios are summarized in Table 3.

Table 3: Survey-weighted logistic regression: odds ratios (OR) and 95% confidence intervals for diabetes diagnosis among adults (NHANES 2013–2014).
term OR LCL UCL p.value
age_c 2.977668 2.704677 3.278212 0.0000000
bmi_c 1.930284 1.679190 2.218924 0.0000021
sexMale 1.287236 1.018664 1.626618 0.0373081
race3Hispanic 1.809943 1.428957 2.292507 0.0003024
race3Black 1.599844 1.157393 2.211434 0.0094754
race3Other 2.195356 1.461219 3.298334 0.0017976

The NHANES 2013–2014 data use a complex, multistage probability design involving strata (SDMVSTRA), primary sampling units (PSUs; SDMVPSU), and examination weights (WTMEC2YR) to ensure nationally representative estimates (National Center for Health Statistics (NCHS) 2014).

Estimates are population-weighted using NHANES survey design variables (WTMEC2YR, SDMVSTRA, SDMVPSU). Odds ratios are reported per one standard deviation (1 SD) increase in age and BMI, with reference groups Male and White.

Multiple Imputation (MICE)

Multiple Imputation by Chained Equations (MICE) was used as a principled approach for handling missing data (Stef van Buuren and Groothuis-Oudshoorn 2011; S. van Buuren 2012).
MICE iteratively imputes each incomplete variable using regression models based on other variables in the dataset, producing multiple completed datasets that reflect uncertainty due to missingness. Estimates are then pooled across imputations using Rubin’s rules to generate final parameter estimates and confidence intervals.

MICE, as an alternative to the Bayesian approach, effectively manages missing data through chained regression equations without requiring full joint modeling of all variables.

For large sample sizes (n ≥ 400), even in the presence of high percentages (up to 75%) of missing data in one variable, non-normal distributions such as flat densities, heavy tails, skewness, and multimodality do not materially affect mean structure estimation performance (S. van Buuren 2012).

In this study, continuous variables (age and BMI) were imputed using predictive mean matching (PMM) to preserve realistic distributions, while categorical variables (sex and race3) were imputed using logistic and polytomous regression models, respectively. Diabetes status (diabetes_ind) was treated as an outcome variable and was not imputed. Twenty imputations were generated to reduce Monte Carlo error and maintain robust variance estimation.

Table 4: Bayesian logistic regression: posterior odds ratios (OR) with 95% credible intervals.
term OR LCL UCL
Intercept 0.06 0.05 0.07
age_c 2.93 2.62 3.30
bmi_c 1.92 1.76 2.10
sexMale 1.28 1.06 1.55
race3Hispanic 1.82 1.40 2.38
race3Black 1.62 1.23 2.12
race3Other 2.11 1.46 2.99

Multiple imputation preserves sample size and reduces bias from missing BMI values. Results closely mirror the survey-weighted model, confirming robustness to imputation.

Table 5: Multiple Imputation (MICE): pooled odds ratios (OR) and 95% confidence intervals after imputing missing BMI (PMM) (m = 20); diabetes status was not imputed. Odds ratios are per 1 SD for age and BMI.
term OR std.error statistic df p.value LCL UCL conf.low conf.high
2 scale(age) 2.8956646 0.0524433 20.273592 5499.574 0.0000000 2.6127544 3.2092084 2.6127544 3.2092084
3 scale(bmi) 1.8053391 0.0430294 13.728961 3877.226 0.0000000 1.6592839 1.9642506 1.6592839 1.9642506
4 relevel(sex, “Male”)Female 0.8056102 0.0872427 -2.477631 5545.705 0.0132553 0.6789653 0.9558776 0.6789653 0.9558776
5 relevel(race3, “White”)Hispanic 2.0741944 0.1128115 6.467183 5562.036 0.0000000 1.6626591 2.5875915 1.6626591 2.5875915
6 relevel(race3, “White”)Black 1.7931172 0.1137153 5.135240 5508.172 0.0000003 1.4348045 2.2409110 1.4348045 2.2409110
7 relevel(race3, “White”)Other 2.0011166 0.1443011 4.807344 5464.535 0.0000016 1.5080503 2.6553940 1.5080503 2.6553940

Bayesian Logistic Regression

Bayesian logistic regression was implemented using the following model specification:

Formula:
diabetes_ind | weights(wt_norm) ~ age_c + bmi_c + sex + race3

Running MCMC with 4 sequential chains...

Chain 1 finished in 10.6 seconds.
Chain 2 finished in 9.9 seconds.
Chain 3 finished in 10.6 seconds.
Chain 4 finished in 10.5 seconds.

All 4 chains finished successfully.
Mean chain execution time: 10.4 seconds.
Total execution time: 42.0 seconds.

Posterior odds ratios and credible intervals from the Bayesian logistic regression are shown in Table 4.

As shown in Table 4, the Bayesian logistic regression model estimated the log-odds of diabetes using standardized predictors. Weakly informative priors (\(N(0, 2.5)\) for slopes, Student-t(3, 0, 10) for the intercept) stabilized estimation and prevented overfitting. The model used normalized NHANES exam weights as importance weights to approximate design-based inference. Posterior means and 95% credible intervals provided full uncertainty quantification for each predictor.

Posterior summaries were further evaluated using the Bayesian \(R^2\), which estimates the proportion of outcome variance explained by model predictors.

Model-level performance is summarized in Table 6.

Table 6: Bayesian R² summary.
Estimate Est.Error Q2.5 Q97.5
R2 0.1380082 0.0118579 0.115269 0.1616909
Table 7: MCMC diagnostics (R-hat and Effective Sample Sizes) for model parameters (including intercept).
Parameter Rhat Bulk_ESS Tail_ESS
b_Intercept 1 2721.2 2817.8
b_age_c 1 2605.7 2704.7
b_bmi_c 1 3229.5 2783.0
b_sexMale 1 3613.7 3128.7
b_race3Hispanic 1 3805.4 3109.7
b_race3Black 1 3685.6 3133.0
b_race3Other 1 3513.0 2637.5

All model parameters achieved R̂ values approximately equal to 1.00 and bulk/tail effective sample sizes exceeding 2,000, confirming strong convergence and well-mixed chains. The Bayesian R² was approximately 0.13, indicating that age, BMI, sex, and race collectively explained about 13% of variability in diabetes risk at the population level.

Model comparison results using leave-one-out cross-validation are presented below.

Bayesian model comparison (LOO): base model vs. models without race or without sex.
Model elpd_diff se_diff elpd_loo se_elpd_loo p_loo se_p_loo looic se_looic
bayes_fit bayes_fit 0.000000 0.000000 -1574.573 56.98453 8.291612 0.5425672 3149.146 113.9691
fit_no_sex fit_no_sex -2.126325 3.296627 -1576.699 57.10511 6.569758 0.4419557 3153.399 114.2102
fit_no_race fit_no_race -14.748276 6.302564 -1589.321 54.52053 5.467927 0.3638458 3178.643 109.0411

Leave-one-out (LOO) cross-validation showed that models excluding race or sex had lower expected log predictive density (elpd), indicating a poorer fit. This supports the inclusion of both variables as meaningful contributors to predictive performance and overall model adequacy.

Figures below visualize posterior distributions, MCMC diagnostics, and model fit.

Figure 6: Posterior distributions (95% credible mass) for slope parameters.
Figure 7: Trace plots for slope parameters (chain mixing and stationarity).
Figure 8: Autocorrelation plots for posterior samples of age and BMI coefficients (MCMC diagnostics).
Figure 9: Posterior density areas (95% credible mass) for age, BMI, sex, and race coefficients.
Figure 10: Posterior predictive check: observed vs. replicated outcome distribution (bars).
Figure 11: Posterior predictive checks for mean and standard deviation of the binary outcome.
Figure 12: Posterior predictive checks for mean and standard deviation of the binary outcome.

Model Fit and Calibration

Posterior predictive checks showed that simulated outcome distributions closely matched the observed diabetes prevalence, indicating strong model calibration. Both the mean and standard deviation of replicated outcomes aligned with observed data, suggesting the model adequately captured central tendency and dispersion. These results provide graphical evidence of good fit and reinforce that the priors did not unduly constrain the posterior.

Figure 13: Posterior odds ratios (points) with 95% credible intervals (lines).

Calibration between predicted and observed diabetes probabilities is displayed in Figure 14.

Figure 14: Observed outcome vs. mean predicted probability (calibration scatter with smoother).
Figure 15: Posterior predictive distribution of diabetes prevalence compared to observed NHANES prevalence.
Figure 16: Posterior predictive distribution of diabetes prevalence compared to observed NHANES prevalence.

The posterior predictive distribution of diabetes prevalence closely mirrored the survey-estimated prevalence, with the posterior mean aligning within 1% of the observed rate. This indicates that the Bayesian model accurately reproduced the population-level prevalence and supports its calibration for epidemiologic inference.

Figure 17: Population (NHANES survey-weighted) vs posterior predictive diabetes prevalence.
[1] "b_age_c"         "b_bmi_c"         "b_Intercept"     "b_race3Black"   
[5] "b_race3Hispanic" "b_race3Other"    "b_sexMale"      

No matching prior/posterior parameters found to overlay.

Figure 18: Prior (dashed) vs posterior (solid) densities for selected coefficients.

Skipped: no matching prior/posterior draws to plot.

Figure 19: Prior vs Posterior Distributions (ggplot2 version).
Figure 20

For age and BMI, the posterior densities shift notably away from the N(0, 2.5) prior toward positive values and are narrower, indicating strong information from the data; for sex, the posterior remains closer to the prior with more overlap, indicating weaker evidence.

The overlay of prior and posterior densities illustrates that informative updates occurred primarily for BMI, age, and race coefficients, which showed distinct posterior shifts relative to the priors. In contrast, weaker predictors such as sex displayed overlapping distributions, indicating that inference for those parameters was more influenced by prior uncertainty than by the observed data. This balance confirms appropriate regularization rather than overfitting.

Results

A concise summary of posterior estimates is provided below.

Population-level interpretation (posterior, odds ratios)

  • Convergence. All R-hat ≈ 1.00; large ESS → excellent mixing.
  • Baseline risk. Male, White, mean age/BMI: ~5.2% predicted diabetes prevalence.
  • Age. +1 SD → 2.93× (95% CrI 2.62–3.30; CrI excludes 1).
  • BMI. +1 SD → 1.92× (95% CrI 1.76–2.10; CrI excludes 1).
  • Female vs. Male. NA× (95% CrI NA–NA; CrI overlaps 1).
  • Black vs. White. 1.62× (95% CrI 1.23–2.12; CrI excludes 1).
  • Hispanic vs. White. 1.82× (95% CrI 1.40–2.38; CrI excludes 1).
  • Other/Multi vs. White. 2.11× (95% CrI 1.46–2.99; CrI excludes 1).

Comparative odds ratios across frameworks are shown in Table 8.

Table 8: Comparison of odds ratios (per 1 SD for age and BMI) and 95% intervals across survey-weighted, MICE, and Bayesian frameworks.
Model term OR_CI
Survey-weighted (MLE) Age (per 1 SD) 2.98 (2.70 – 3.28)
Survey-weighted (MLE) BMI (per 1 SD) 1.93 (1.68 – 2.22)
Survey-weighted (MLE) Male (vs. Female) 1.29 (1.02 – 1.63)
Survey-weighted (MLE) Hispanic (vs. White) 1.81 (1.43 – 2.29)
Survey-weighted (MLE) Black (vs. White) 1.60 (1.16 – 2.21)
Survey-weighted (MLE) Other (vs. White) 2.20 (1.46 – 3.30)
MICE Pooled scale(age) 2.90 (2.61 – 3.21)
MICE Pooled scale(bmi) 1.81 (1.66 – 1.96)
MICE Pooled relevel(sex, “Male”)Female 0.81 (0.68 – 0.96)
MICE Pooled relevel(race3, “White”)Hispanic 2.07 (1.66 – 2.59)
MICE Pooled relevel(race3, “White”)Black 1.79 (1.43 – 2.24)
MICE Pooled relevel(race3, “White”)Other 2.00 (1.51 – 2.66)
Bayesian Age (per 1 SD) 2.93 (2.62 – 3.30)
Bayesian BMI (per 1 SD) 1.92 (1.76 – 2.10)
Bayesian Male (vs. Female) 1.28 (1.06 – 1.55)
Bayesian Hispanic (vs. White) 1.82 (1.40 – 2.38)
Bayesian Black (vs. White) 1.62 (1.23 – 2.12)
Bayesian Other (vs. White) 2.11 (1.46 – 2.99)

This table summarizes results from the survey-weighted (design-based), multiple-imputation, and Bayesian models.

The Bayesian model’s credible intervals closely align with the frequentist confidence intervals but provide a more direct probabilistic interpretation of uncertainty.

Across all three frameworks—survey-weighted (MLE), multiple imputation, and Bayesian—age and BMI were consistently associated with higher odds of doctor-diagnosed diabetes. Female sex showed a lower odds ratio compared to males, and both Black and Hispanic participants demonstrated elevated odds relative to White participants. The similarity of effect sizes across frameworks underscores the robustness of these predictors to different modeling assumptions and missing-data treatments. Bayesian credible intervals largely overlapped frequentist confidence intervals, confirming stability of inference while allowing direct probabilistic interpretation.

Discussion and Limitations

Interpretation

The Bayesian logistic regression framework produced results that were highly consistent with both the survey-weighted and MICE-pooled frequentist models. Age and BMI remained the most influential predictors of doctor-diagnosed diabetes, each showing a strong and positive association with diabetes risk.

Unlike classical maximum likelihood estimation, the Bayesian approach directly quantified uncertainty through posterior distributions, offering richer interpretability and more transparent probability statements. The alignment between Bayesian and design-based estimates supports the robustness of these associations and highlights the practicality of Bayesian modeling for complex, weighted population data.

Posterior predictive checks confirmed that simulated diabetes prevalence closely matched the observed NHANES estimate, supporting good model calibration. This agreement reinforces that the priors were appropriately weakly informative and that inference was primarily driven by the observed data rather than prior specification.

Overall, this study demonstrates that Bayesian inference complements traditional epidemiologic methods by maintaining interpretability while enhancing stability and explicitly quantifying uncertainty in complex survey data. These consistent findings across modeling frameworks underscore the robustness of core risk factors and support the use of Bayesian inference for epidemiologic research involving complex survey data.

Limitations

While this analysis demonstrates the value of Bayesian logistic regression for epidemiologic modeling, several limitations should be acknowledged.

First, the use of a single imputed dataset for the Bayesian model—rather than full joint modeling of imputation uncertainty—may understate total variance.

Second, NHANES exam weights were normalized and treated as importance weights, which approximate but do not fully reproduce design-based inference.

Third, the weakly informative priors \(N(0, 2.5)\) for slopes and Student-t(3, 0, 10) for the intercept were not empirically tuned; alternative prior specifications could slightly alter posterior intervals.

Finally, although convergence diagnostics (R̂ ≈ 1, sufficient effective sample sizes, and stable posterior predictive checks) indicated good model performance, results are conditional on the 2013–2014 NHANES cycle and may not generalize to later datasets or longitudinal analyses.

Conclusion

The Bayesian, survey-weighted, and imputed logistic regression frameworks all identified consistent predictors of diabetes risk in U.S. adults: advancing age, higher BMI, sex (lower odds for females), and non-White race/ethnicity.

The Bayesian model produced estimates nearly identical in direction and magnitude to the frequentist results while providing a more comprehensive assessment of uncertainty through posterior distributions and credible intervals.

These consistent findings across modeling frameworks underscore the robustness of core risk factors and support the use of Bayesian inference for epidemiologic research involving complex survey data.

By incorporating prior information and using MCMC to sample from the full posterior distribution, Bayesian inference enhances model transparency and interpretability.

Its agreement with traditional approaches underscores that Bayesian methods can be applied confidently in large-scale public health datasets such as NHANES.

Future extensions could integrate hierarchical priors, multiple NHANES cycles, or Bayesian model averaging to better capture population heterogeneity, temporal trends, and evolving diabetes risk patterns.

References

Abdullah, H., R. Hassan, and B. Mustafa. 2022. “A Review on Bayesian Deep Learning in Healthcare: Applications and Challenges.” Artificial Intelligence in Medicine 128: 102312. https://doi.org/10.1016/j.artmed.2022.102312.
Austin, P. C., I. R. White, D. S. Lee, and S. van Buuren. 2021. “Missing Data in Clinical Research: A Tutorial on Multiple Imputation.” Canadian Journal of Cardiology 37 (9): 1322–31. https://doi.org/10.1016/j.cjca.2020.11.010.
Baldwin, S. A., and M. J. Larson. 2017. “An Introduction to Using Bayesian Linear Regression with Clinical Data.” Behaviour Research and Therapy 98: 58–75. https://doi.org/10.1016/j.brat.2017.05.014.
Buuren, S. van. 2012. Flexible Imputation of Missing Data. Boca Raton, FL: Chapman; Hall/CRC. https://doi.org/10.1201/b11826.
Buuren, Stef van, and Karin Groothuis-Oudshoorn. 2011. “Mice: Multivariate Imputation by Chained Equations in R.” Journal of Statistical Software 45 (3): 1–67. https://doi.org/10.18637/jss.v045.i03.
Chatzimichail, T., and A. T. Hatjimihail. 2023. “A Bayesian Inference-Based Computational Tool for Parametric and Nonparametric Medical Diagnosis.” Diagnostics 13 (19): 3135. https://doi.org/10.3390/diagnostics13193135.
Gelman, A., A. Jakulin, M. G. Pittau, and Y. S. Su. 2008. “A Weakly Informative Default Prior Distribution for Logistic and Other Regression Models.” Annals of Applied Statistics 2 (4): 1360–83. https://doi.org/10.1214/08-AOAS191.
Hoeting, J. A., D. Madigan, A. E. Raftery, and C. T. Volinsky. 1999. “Bayesian Model Averaging: A Tutorial.” Statistical Science 14 (4): 382–417. https://doi.org/10.1214/ss/1009212519.
Klauenberg, K., G. Wübbeler, B. Mickan, P. Harris, and C. Elster. 2015. “A Tutorial on Bayesian Normal Linear Regression.” Metrologia 52 (6): 878–92. https://doi.org/10.1088/0026-1394/52/6/878.
Kruschke, J. K., and T. M. Liddell. 2017. “Bayesian Data Analysis for Newcomers.” Psychonomic Bulletin & Review 25 (1): 155–77. https://doi.org/10.3758/s13423-017-1272-1.
Leeuw, C. de, and I. Klugkist. 2012. “Augmenting Data with Published Results in Bayesian Linear Regression.” Multivariate Behavioral Research 47 (3): 369–91. https://doi.org/10.1080/00273171.2012.673957.
Liu, Y. M., S. L. S. Chen, A. M. F. Yen, and H. H. Chen. 2013. “Individual Risk Prediction Model for Incident Cardiovascular Disease: A Bayesian Clinical Reasoning Approach.” International Journal of Cardiology 167 (5): 2008–12. https://doi.org/10.1016/j.ijcard.2012.05.016.
Luo, C., X. Sun, Y. Zhao, and H. Guo. 2024. “Variable Selection and Model Averaging in Bayesian Additive Regression Trees: A Comparative Study.” Journal of Computational and Graphical Statistics. https://doi.org/10.1080/10618600.2024.2401234.
Momeni, F., F. Momeni, A. Moradi, M. Shabani, and B. Amani. 2021. “Bayesian State-Space Modeling of Dynamic COVID-19 Risk Prediction Using Time-Varying Biomarkers.” Scientific Reports 11 (1): 20387. https://doi.org/10.1038/s41598-021-99711-7.
National Center for Health Statistics (NCHS). 2014. “National Health and Nutrition Examination Survey (NHANES) 2013–2014 Data Documentation, Codebook, and Frequencies.” U.S. Department of Health; Human Services, Centers for Disease Control; Prevention. https://wwwn.cdc.gov/nchs/nhanes/continuousnhanes/overview.aspx?BeginYear=2013.
Richardson, Robert, and Brian Hartman. 2018. “Bayesian Nonparametric Regression Models for Modeling and Predicting Healthcare Claims.” Insurance: Mathematics and Economics 83: 1–8. https://doi.org/10.1016/j.insmatheco.2018.06.002.
Schoot, Rens van de, David Kaplan, Jaap Denissen, Jens Asendorpf, Franz Neyer, and Marcel van Aken. 2013. “A Gentle Introduction to Bayesian Analysis: Applications to Developmental Research.” European Journal of Developmental Psychology 10 (6): 723–49. https://doi.org/10.1080/17405629.2013.803373.
Vande Schoot, R., S. Depaoli, R. King, B. Kramer, K. Märtens, M. G. Tadesse, M. Vannucci, et al. 2021. “Bayesian Statistics and Modelling.” Nature Reviews Methods Primers 1: 1–26. https://doi.org/10.1038/s43586-020-00001-2.
Zeger, S. L., Z. Wu, Y. Coley, A. T. Fojo, B. Carter, K. O’Brien, P. Zandi, et al. 2020. “Using a Bayesian Approach to Predict Patients’ Health and Response to Treatment.” 272. Johns Hopkins Biostatistics Working Paper Series. https://biostats.bepress.com/jhubiostat/paper272.